CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 06: Strings and Regular Expressions

Student Name: Live Lecture HTML

Student ID: 220217


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this sixth lecture in a series of seven. We've previously covered data structures, data wrangling, plotting and flow control but today we will return to the DataFrame object to dive further into an aspect of data wrangling with string manipulation and regular expressions.

At the end of this lecture we will aim to have covered the following topics:

  1. Regular expressions
  2. Methods of the string library
  3. Regex examples in action
  4. Regex in pandas

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: human_microbiome_project_otu_taxa_table_subset.csv

This is a subset of the taxa table we used back in lecture 03. We'll be using it to look at regular expression with the Pandas package and DataFrames.

0.4.2 Dataset 2: sequences.tsv

This is a small example dataset that we'll use to practice some string manipulation and DNA sequence formatting.


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

pandas provides the DataFrame class that allows us to format and play with data in a tabular format.

re provide regular expression matching functions that are similar to those found in the programming language Perl


1.0.0 Regular expressions

"A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as 'write only', because regular expressions are easier to write than to read/understand. And they are not particularly easy to write." - Jenny Bryan (Tidyverse software engineer)

RegEx is a very powerful and sophisticated way to perform string manipulation. Common uses of string manipulation are: searching, replacing or removing (making substitutions), and splitting or combining substrings.

So why do regular expressions or 'RegEx' get so much flak if it is so powerful for text matching? Scary example: how to verify an email address in different programming languages http://emailregex.com/.

Writing/reading RegEx is definitely one of those situations where you should annotate your code. There are many terrifying urban legends about people coming back to their code and having no idea what their code means.

xkcd-1171-perl_problems.png

1.0.1 Regex example with Microsoft word

For our first regex exercise, use Microsoft Word to open the file "ScallionCakes.docx". This file contains the beginnings of some pre-amble text for a recipe blog. Here is what we are going to do:

The key to this example, of course, is being more specific with what we want to search for and replace. To accomplish that we turn to RegEx which offers a structured language/syntax for identifying string patterns. Let's take a closer look at it now.

1.1.0 Python’s Regular Expressions (re) library

re Is Python's built-in package for regular expressions. The re module offers a set of functions that facilitates the searching of a string for a match. These functions include:

Function Description
findall() Returns a list containing all matches
finditer() Returns an iterator containing all matches
search() Returns a Match object if there is a match anywhere in the string
split() Returns a list where the string has been split at each match
sub() Replaces one or many matches with a string
escape() Escapes any special characters in a pattern that may be unavoidably present

The re module uses a set of metacharacters and special sequences designed to match strings based on features such as only number, only non-number, only letters, etc. These are all contained in the next three tables.

1.1.1. Metacharacters

Metacharacters are characters with a special meaning when interpreted as a regular expression:

Character Description Example
[ ] A set of characters "[a-m]"
\ Signals a special sequence (can also be used to escape special characters) "\d"
. Any character (except newline character) "he..o"
^ The search string starts with "^hello"
\$ The search string ends with "world\$"
* Zero or more occurrences "aix*"
? Zero or one occurences "aix?"
+ One or more occurrences "aix+"
{ } Exactly the specified number of occurrences "al{2}"
Match between 0 and 3 repetitions "al{,3}
Match between 3 and 10 repetitions "al{3,10}
Match between 3 and infinite repetitions "al{3,}
| Either or "falls|stays"
( ) Capture and group ("exact*matches")

1.1.2 Special Sequences

A special sequence is a an escape character, \, followed by one of the characters in the list below, and has a special meaning:

Character Description Example
\A Returns a match if the specified characters are at the beginning of the string "\AThe"
\b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"
\B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"
\d Returns a match where the string contains digits (numbers from 0-9) "\d"
\D Returns a match where the string DOES NOT contain digits "\D"
\s Returns a match where the string contains a white space character. This includes \s, \t, \n, etc. "\s"
\S Returns a match where the string DOES NOT contain a white space character "\S"
\w Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) "\w"
\W Returns a match where the string DOES NOT contain any word characters "\W"
\Z Returns a match if the specified characters are at the end of the string "word\Z"

1.1.3 Sets

A set is a group of characters inside a pair of square brackets [ ] with a special meaning:

Set Description
[arn] Returns a match where one of the specified characters (a, r, or n) are present
[a-n] Returns a match for any lower case character, alphabetically between a and n
[^arn] Returns a match for any character EXCEPT a, r, and n. Therefore ^ negates but only within a set
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9] Returns a match for any digit between 0 and 9
[0-5][0-9] Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case
[+] In sets, +, *, ., |, (), \$, {} have no special meaning, so [+] means: return a match for any + character in the string

1.1.4 Troubleshooting RegEx

LearningRegex.jpg
We all have trouble with RegEx.

Trouble-shooting RegEx can take time and sometimes you're better off working with a simulator that can help you out. Here are a couple of helpful sites where you can test your RegEx patterns:

https://regex101.com/
https://regexr.com/

RegexBooks.png
Otherwise, know that you are NOT alone in your struggle!

1.2.0 Exploring the re RegEx library functions

Time to explore re's functions (https://docs.python.org/3/library/re.html)


1.2.1 findall() helps to pattern-match within a string

Sometimes we may want to find the occurrence of a specific pattern within a larger string. A good example is searching for sequence motifs in a larger block of genomic sequence. If we were working with a text editor like Microsoft Word we would use the Find tool and provide the pattern. Any matches would be listed somewhere for us to look at further.

In Python, the findall(pattern, string) function returns all non-overlapping matches of pattern within string as a list of strings. When using capture groups, this will return a list of tuples if the pattern has more than one group. Empty matches will also be returned in the result. Much like a text editor, findall() simply returns a list of the matches it finds, although this doesn't provide it's position within the string!

However, by providing regular expressions to the pattern parameters, we can identify more complex matching patterns, giving us more power and flexibility than a text editor. Let's start with a simple example.


1.2.2 search() returns a Match object

As we see above we can get back the complex matches from a pattern but we aren't given their position within our string. If we are interested in where a pattern match first occurs in our search string - which is often what we want - then we can use the search() function.

The search(pattern, string) function searches the parameter string for a match to pattern, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

Note that there is a similar function match() which only searches for a pattern starting at the first position of an input string. This also returns a Match object. We'll talk more about that soon.

Let's take a look at how search() works.


1.2.3 Break your string up by a pattern using the split() function

Most commonly you will encounter a dataset where columns of variables contain multiple data values concatenated by a character like :, ; or even \t. You may also want to break up sequence data based on specific patterns etc. The split(pattern, string) function returns a list where the string has been split at each match.

The optional parameter maxsplit takes a non-zero value to determine the maximum number of splits to produce, with the remaining unsplit string text being appended as the last element of the return list. The default value of maxsplit is 0.

Let's try an example using the maxsplit parameter.


1.2.4 Replace a pattern in a string using the sub() function

In some cases, we don't want to just split the information but would rather replace a pattern we are interested in with a different text entry instead. We can use the sub() function instead which takes the parameters

Let's see what happens when we used the count parameter.

Part of normalizing text is removing special characters


1.2.5 Taking a closer look at the re.Match object

The re.Match object returned from search() and match() has information on the actual pattern match that has occured in our search() call. It also has a number of methods and attributes that we'll find useful for acccessing that information:

Read more: Learn more about the re.Match object from the Python docs

Let's return to one of our first examples.


Let's try something more complicated...


1.2.6 Escaping the \ escape character with \ and r

A couple of notes about the code inside search() in the last block

\ (backslash) is known as an "escape" character. It precedes special characters and "escapes" them, meaning that Python - or other programming languages for that matter - will know that those escaped characters are to be interpreted carefully. For instance, $ tells Python that we are using the dollar sign for regex purposes to match the end of a string whereas \$ ensures that we are simply looking for the literal "$" sign. Without the escape character, Python would treat it as its metacharacter. The same principle applies to all special characters.

Often you will need to "escape" the escape character because the Python interpreter also provides special meaning to the \ character. When reading through a string, the Python interpreter uses \ to help it identify special characters like \t (tab), \n (new line), or \' (treat as an apostrophe instead of end-quote).

Since the \ itself is a special character, and regex patterns are strings, then we need to alert Python to the fact that we are using the \ under a separate context as it interprets a string before passing it along to the regular expression interpreter.


backslashes.png

As you can guess, it could be tedious to memorize when you actually need to escape your backslashes, like-wise it can be cumbersome to escape every escape character (as you should)! Python provides a convenient solution to this issue.

You can leave all the madness behind by beginning your regex sequences with r. This special tag preceding your regex string will tell Python to treat the string in its raw form without altering any of the backslashes.

In our following example, r is escaping the backslash in \b and \w for us. Try removing r or any of the two \ in our first example and see what kind of error you get. Python provides no hint about what the problem is.


1.3.0 Take the time to comment your RegEx code

To review the when and why of the functions we used:

SquidGames_regex.png
At this point you should have a strong sense of how important it is to annotate your regex code. It can get obscure very quickly.

Remember you may come back to this code in 6 months or 6 days. You'll want a quick reminder about your frame of mind at that time and commenting your code is the simplest way to convey to your future-self, what your past-self was thinking.

Section 1.0.0 Comprehension Question: Take a look at the following coding cell. In the DNA sequence provided, how many times does the pattern TAG occur? What is the first occurrence?

2.0.0 The String library

We've spent a lot of time working on regular expressions to search strings but we haven't really touched much on the String library itself. You are already familiar with several functions from the String library. Let's review what we already know about String objects.

2.1.0 Concatenate strings with the +

Let's start with concatenation, which we have used before to print several strings together.


Notice than when using the comma syntax in the print() function, a second space inside the strings is not necessary. By default, the print() function uses a space to concatenate input.

2.2.0 Subset strings by their 0-indexed position

Remember that we can index strings much like we do arrays and lists using a 0-indexed positioning system.


2.2.1 String striding slices Strings in a stepped fashion

Up until now, all of our slicing has been to retrieve all elements between two points. We can, however, apply a step value k to our String index which will return every kth element within our slice. A negative value of k will result in indexing in the reverse direction from the right end of the string.

This form of slicing is also known as striding. Let's see some examples.


Palindromes are words that can be read in both directions without changing the meaning. Specific palindromic sequences are recognized as restriciton sites by many endonucleases .


2.3.0 Break a string up using the split() method

This looks a lot like the re.split() function but the split() method breaks up your string, from the left side using a constant sequence. This results in faster processing as there is less overhead than in dealing with regular expressions. For simple and constant splits, use this method instead. Recall that this method also has a maxsplit parameter.

There is a second method rsplit() which performs like split() except it begins from the right side of the string object.

In the example below, string_1 will be split using the letter "o", which is removed from the string.


2.4.0 Concatenate strings in a list using the join() method

We encountered this method briefly in the first lecture and recently in your section 1.0.0 comprehension question! Recall that the join() method takes all the items from an input iterable and joins them into a single string. This method is called from a string object which acts as the separator.


2.5.0 Use the splitlines() method to remove line boundaries and return a list

As we've seen in other lectures, the \n character is used to denote a new line in text. You can use the splitlines() method to break up a string by the \n character and return a list where each element is a line of text. The parameter keeplinebreaks specifies if line breaks should be included as part of each line.

Note below that we've used the ''' triple quote as a way to make our string span multiple lines. This is purely for the purposes of making our code more readable and can be used to also make multi-line strings. This will, however, cause the inclusion of additional \n characters every time a new line is started.

Section 2.0.0 Comprehension Question: Complete the for-loop below to iterate through each line in othello_list and split it further by the presence of punctuation ie ".", "!", ";", and "," .

3.0.0 Regex in action

Now, a regex example relevant to biologists: A string of DNA.

Have you seen a fasta file before? They are the standard format to represent nucleotide and amino acid sequences using single letter codes in text files, and look like the string dino code below:

*** This piece of DNA is from the book Jurassic park, and was supposed to be dinosaur DNA, but is actually just a cloning vector. Bummer.

dino.notdino.png


3.1.0 Removing substrings with regex

This string is in FASTA format, but we don't need the header; we just want to deal with the DNA sequence. The header begins with '>' and ends with a number, '1200', with a space between the header and the sequence. Let's practice capturing each of these parts of a string, and then we'll make a raw regular expression to remove the entire header.

'>' Is at the beginning of the string so we can use the function re.sub() which looks at the beginning of a string.


3.1.1 Use lstrip() and rstrip() to remove characters

In case you are only interested in removing the characters beginning at the first position or end (last position) of string, you can use the lstrip() (left) and rstrip() (right) methods respectively. Given a set of characters, they will continue to remove leading or trailing sequence that matches those characters.

Note that both lstrip() and rstrip() will treat our string pattern char equivalent to a set ie [char]. Let's see what effect has on our output.


3.1.2 Search for numbers in our header

Next we can search for numbers. The expression [0-9] is looking for any number. Always make sure to check that the pattern you are using gives you the output you expect.

We'll play with the sub() function to see how this works.

3.1.3 Capture whitespace using the \s pattern

How do we capture spaces? The pattern \s denotes a space. However, for the backslash to not be used as an escape character (its special function), we need to add another backslash, making our pattern \\s or we need to use the raw format for our regular expression.

We'll use the string replace() method to replace these with a blank character (or nothing) but this method does not accept regular expressions as input so we must use ' ' as input instead.


3.1.4 Another way to capture those white spaces is with sub()

As you can see from above our replace() call only replaced actual spaces and did not replace any of the newline characters - which we also want to fix. Unlike the replace() method, the re.sub() function interprets \s as any whitespace character! Let's see it in action!


3.1.5 Combine multiple patterns to make a single regular expression

To remove the entire header, we need to combine the patterns we've tested. The header is everything in between '>' and the number '1200' followed by a space. Recall:


3.1.6 Repetition qualifiers are greedy!

Here's our regex pattern: ^>.*[0-9]\s|\n.

Which retrieves: ">DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 ".

Why doesn't it give us: ">DinoDNA from Crichton JURASSIC PARK p. 103 "?

Let's break it down. In our above code we generate the regex: ^>.*[0-9]\s|\n which includes the * qualifier. Recall from our table that the qualifiers *, +, {m, n},and ? allow us to search for pattern matches in a range from (0, infinity), (1, infinity), (m, n) and (0, 1) respectively. These qualifiers are implemented in a greedy fashion meaning they will continue matching as many characters as possible within the range.

We can, however, choose a lazy/non-greedy/minimal matching approach where these qualifiers match as few characters as possible. To switch our qualifiers to lazy matching we can add another ?. Yes that's right, there's an additional metacharacter meaning for ? in the context of qualifiers: *?, +?, {m,n}? and ??! The second ? signals to the Regex interpreter that you'd like to implement those qualifiers with lazy matching!

Let's see what happens if we update our regex pattern.


Just as we predicted, it made the shorter match instead! So remember that the default of these qualifiers is to produce greedy matching!

In our case, of course, we want greedy matching to replace the entire header. Let’s save the dna into its own object.


3.2.0 Extract information from strings with search()

We may also want to retain our header in a separate string rather than just removing it. In that case, recall that we want to use a function like search() which will retain the string that matches our pattern, rather than removing it. We can save this in an object called header and then access the header by calling the group() method.


3.2.1 Capture groups save information about your matches

Now that we understand greedy matching we can also introduce the idea of capture groups. In our above Regex, we know now that from ^>.*[0-9]\s the .* is being matched to DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-120 because the . will accept any character value. Remember, we are greedy matching so the last 0 is matched to our [0-9] portion of the pattern. You may be skeptical but we can prove this with the capture group!

The capture group is denoted by parentheses ( ) in our regex patterns and can be used to capture subgroups from our pattern. This can be useful if you want to re-insert/re-use the information later or break it into multiple columns.

To access capture groups from a match object, use the group(index) method with an index value where 0 holds the entire match, and each capture group is at increasing indices. You can also access all groups as a tuple with the groups() method.

You can even name your capture groups in your regular expression! Let's keep it simple for now.


3.3.0 Searching strings for information

Now we can look for patterns in our (dino) DNA! Does this DNA have balanced GC content? We can use re.findall() function to capture every character that is either a G or a C. We'll start by just viewing the start of our search.


3.4.0 Replacing parts of your string

Let's translate our dinosaur DNA into mRNA! There are two different ways we can explore to replace multiple patterns at once.

First, we can use the string replace() method through method-chaining on the different bases to replace. Note that in our example we need to call on replace() five times! The extra replace() call allows us to initially create a placeholder symbol for transcribing G to C. Otherwise when we replace C with G in the second replace() call, we'll also change the bases that we just altered.


3.4.1 Generate a lambda function to help replace multiple patterns/values

A lamba function is a small anonymous function that can take any number of arguments but it can only have one expression. We started to look at it in last week's lecture. Let's take a closer look at it now. The lambda function takes the form:

lambda arguments: expression

This syntax allows a developer to quickly generate an anonymous function rather than defining one fully in their code (see upcoming Lecture 07). In doing so, it can be directly substituted as part of a larger expression.

That being said, the lambda function can only execute expressions. That means, the result can evaluate to a single value wherease a statement is something like a print() call or variable assignment with the =.

Read more: For more information on expressions, check out the Python reference docs.

Let's run through a quick example.


Seems pretty straighforward right? Normally you would not assign a lambda function to a variable but rather substitute it straight into a function of some kind so let's take it to the next level.


3.4.2 Use sub() in conjunction with a lambda function

Use the re.sub() function in combination with a lambda function. Recall the function takes the form of sub(pattern, repl, string, count) where:

When sub() does find a match to pattern, it will generate a re.Match object and pass that into our lambda function in repl. Within our lambda function we'll implement a dictionary object. Note that the dictionary object will have access to all of it's regular methods and attributes as well. We can use the re.Match.group() method to retrieve the matched string in m.

In our lambda function we will write a dictionary to hold all of the find:replace pairs as key:value pairs. Recall the dictionary method get(key, alt_value) which will return the value associated with key, otherwise it returns alt_value if the key is not found.

With this method there will be no need to generate a placeholder value like before, and it will be much cleaner to alter code later if needed.


3.5.0 Searching our mRNA sequence for information

Is there even a start codon in this thing? Let's use the re.search() function to check.


3.5.1 Remember to use findall() to look for multiple hits

It might be more useful to know exactly how many possible start codons we have. len(re.findall()) Counts the number of matches in the string for our pattern.


3.5.2 Retrieve an iterator of multiple hits wth finditer()

So we know there are 9 hits somewhere in our pattern but we don't exactly know where. A quick way to locate where we have our hits is to use the re.finditer() function which will return a sequence of re.Match objects as an iterator. Remember iterators from last week? How about list comprehension?

We'll use both in the next example to generate a list of start and end positions for all hits on AUG.


3.5.3 Split our mRNA sequence into codons with a skipped range() call

Let's split this string into substrings of codons. Now that we've reviewed list comprehension, we'll use that in conjunction with generating string striding which we discussed earlier. To accomplish that we'll use the range() function to make an iterator. Recall that we can include a step argument $k$ in our call to range() to produce values at every $k^{th}$ value in our range.

Recall that our start codon is located at position 89.

How many times do we see a stop codon in our codon list? We can accomplish a count also using list comprehension and the keyword in.


3.5.4 Convert our codon list to a single string

Of course we can convert the list back into a string using the join() method. Let's put a delimiter between them so we can still track individual codons.

Now our codons are stored and delimited in codon_str. Do we have a stop codon anywhere in our reading frame? Let's check with re.findall()

How many stops codons are there in codon_str?


Where are the stop codons in codon_str located?


So our findall() results from codon_str match up with our list comprehension search using codon_list. That's great!

3.5.5 Split codon_str by stop codon sequence with split()

Let's subset codons based on stop codons. This will create 14 genetic sequences (remember that we have 13 stop codons). We can use the re.split() function to accomplish the task. When we print our results we'll also remove the extra . delimiter we've inserted between codons.


3.5.6 Translate codons into amino acid sequences

Let's go back to our translated codon_str string and further translate them into proteins. First, we need a dictionary where the keys are the translated codons and the values are amino acid codes.

Let's do a mixture of lambda functions and list comprehension again to accomplish our goal on the codon_str object. You'll notice that we don't really need to employ any regular expressions now that we've formatted our data.


And now we have a protein sequence!

Section 3.0.0 Comprehension Question We previously generated a variable codon_list which contained all of the codons from our sequence separated as elements within a list. Is there a way we can translate this list directly to produce a single string of amino acids as we did above? Remember we can already have our translation dictionary pre-coded into the variable translation_aminoacids.

4.0.0 Regular expressions and Pandas

Often we'll encounter or import data that isn't quite in the format we need. In order to wrangle it, we may need to perform some string manipulation based on some pattern matching via regular expressions. As an example, let's return to the human_microbiome_project_otu_taxa_table_subset.csv file from lecture 03. Recall that the dataset had 7 rows, of which, row 0 was a semicolon-delimited version of the other 6 rows of data. This time around we'll do the following:

  1. Retain row 0
  2. Melt the DataFrame into 2 columns
  3. Split each observation value into columns using the semicolons
  4. Remove the x__ prefix that occurs as part of each entry

Let's give it a try!


4.1.0 Retain only the first row by indexing

To retain row 0 we'll simply pull the row and re-assign it to the data variable.


4.1.1 Drop columns with the drop() method

Recall that we can remove columns with the drop() method. In this case, we don't need to retain the information in column 0, so remove it. We want to do this before we melt the data, otherwise the column will get copied for each new melted observation.


4.2.0 Melt the data frame into long format with melt()

Recall that in the process of melting we will convert all of the column names to a single column with one row per column name - these become our observations. All of the values in each column will be relocated as values for the appropriate observations in the melted table.


4.3.0 Split entries by their ; delimiters with pandas.Series.str.split()

Now that we have the table melted, we need to split the value column based on the sequence of text. Let's take a look at an example of the text:

Root;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus

From the data you can see that there is a bit of a pattern emerging. We'll have to make the assumption that if any information is missing, it comes later in the hierarchy sequence for each entry.

We'll take the obvious route and split on the ; first using the pandas.Series.str.split() method. This method splits strings around a given separator/delimiter just like the str.split() method. However, it takes the following parameters:

If, in the course of splitting, the number of splits doesn't match the shape of the current output DataFrame, it pads the missing values with the None value.


4.3.1 Replace the None values using the fill.na() method

Looks like we did generate some None values from our split call. We can replace those with NaN values using the fill.na() method. This method will recognize the Python object None and replace it with whatever value you want. In this case, we'll use the numpy NaN object.


4.4.0 Bring back the OTU column with the concat() function

We now have our original data_melt DataFrame and the split column data_melt_split. We don't want all of either DataFrames but rather just the OTU information from data_melt and the last 5 columns of data_melt_split. Since we haven't sorted any of the DataFrames, the observations should line up correctly. In that case we can just concatenate the columns using the pd.concat() function.

Unlike last week when concatenating rows of DataFrames, we'll specify the axis = 1 parameter to concatenate by columns.


4.4.1 Rename the columns using proper taxonomic ranks

Now that we have the information columns we desire, we can rename the columns using the columns property.


4.5.0 Remove the additional x__ prefixes from our values

Next, we need to remove the double underscores and any preceeding character indicating the taxonomic rank. We'll use the re.sub() function to first get the regular expression working on a sample string.


4.5.1 Change values in a DataFrame using the replace() method

Our code works and in theory should cover all instances of the prefix so now we can apply it to the DataFrame. To do so, we'll use the DataFrame.replace() method which has the following relevant parameters:

Overall, the replace() method looks like a DataFrame version of re.sub() right? Let's give it a try.


4.5.2 Take advantage of patterns to optimize your Regex!

All done! Is there a place in our steps where we could have optimized the process? Yes, as long as the data is well-formed/consistent we can make some assumptions and expand our regex pattern to account for a both the ; and .__ pattern as a delimiter. Doing so can save us a call to replace() altogether.


4.6.0 Another data-formatting challenge

Finally, let's read in sequences.tsv, a tab-separated value file (inspect it using a spreadsheet software). This file contains two unnamed columns, one with the sequence name and another with the sequence itself. Our task is to convert each sequence header and sequence string into a format that resembles a fasta format.

Recall the fasta format looks like:

    >header_with_information
    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

Here is a list of what we will do to the file:

  1. Import the file using pandas.
  2. Add a greater-than ('>') at the beginning of each line. Remember that ">" denotes the beginning of a sequence (sequence header) in fasta files.
  3. Replace tabs (\t) with new line (\n).
  4. Write the final output to disk.

Let's get to it!


4.6.1 Append the > character to each sequence line

You'll note that instead of separating our columns, we actally just created a single column with our header and sequence joined by the \t separator. Now we'll add '>' at the beginning of each sequence using simple string concatenation which will broadcast to each value in our DataFrame.


4.6.2 Complete the fasta sequence format by replacing the \t

Looking at the resulting output we know that it's not quite there yet. Remember from above, the fasta format output we are looking for is:

>header_1
sequence_1
>header_2
sequence_2

This translates to essentially >header_1\nsequence_1 for which we already know the right function to help us out with altering our DataFrame: the replace() method.


4.6.3 Write your DataFrame to disk with to_csv()

Recall we can save our DataFrame to disk using a number of methods available. We'll use the to_csv() method we learned about in lecture 03 to write seq to a file named sequences_line_break.fasta. Since there is only a single column, this will end up writing a fasta-like file with newline characters separating each entry.


Section 4.0.0 Comprehension Question Now that you are much more experienced with RegEx, let's reverse the process we just went through. Use RegEx to bring back sequences_line_break.fasta to its original format sequences.tsv. Write the resulting data frame as sequence_back.tsv.

For more on using Regex in Pandas: Check out this tutorial which may be easier than searching through the Pandas documentation.
Matrix_regex.jpeg
And, that is it. Have fun playing with regex, and, for your own sanity, annotate your code!!!

5.0.0 Class summary

That's our fourth class on Python! You've made it through and we've learned about a number of logical expression operators and how to apply them in loops and filtering data:

  1. Flow control
  2. Logical, conditional, and comparison operators
  3. For loops
  4. Conditional control flow

5.1.0 Submit your completed skeleton notebook (2% of final grade)

At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).

5.2.0 Post-lecture DataCamp assessment (8% of final grade)

Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 (Basic Concepts of String Manipulation, 1050 possible points; Formatting Strings, 1050 possible points; and Regular Expressions for Pattern Matching, 1400 possible points) from the Regular Expressions in Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 2625 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:

DataCamp.example.png
A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, February 17th to submit your assignment (right before the next lecture).


5.3.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.4.0 Resources


6.0.0 Appendix 1

6.1.0 End-of-Line Characters: Unix vs. Windows**

End-of-line (EOL, a.k.a. Newline, line ending, line feed, or line break) are control character(s) that represent the end of a line and the start of a new one. Unix (Linux and Mac) uses "the linefeed" ("\n") single character, while Windows uses "carriage return" and "linefeed" ("\r\n", a.k.a. "CRLF"). You need to be careful about transferring files between Windows machines and Unix machines to make sure the line endings are translated properly. This is specially critical when you prepare scripts in a personal computer that runs on Windows and then execute those scripts on a server that runs on Linux. Shell programs, in particular, will fail in mysterious ways if they contain DOS line endings. From the Unix end, tools such as dos2unix and unix2dos allow you to interconvert between EOL formats. In a Windows machine, the format conversion can be done using the more command: TYPE input_filename | MORE /P > output_filename.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png